Extraction of Indicative Summary Sentences from Imaged Documents

نویسندگان

  • Francine Chen
  • Dan S. Bloomberg
چکیده

A system for selecting sentences from an imaged document for presentation as part of a document summary is presented. The extracts are identified without the use of optical character recognition. The sentences are selected based on a set of discrete features characterizing the words within a sentence and the location of the sentence within the imaged document. Each sentence is scored based on the values of the discrete features using a statisticallybased classifier. The imaged document is processed to identify the word locations, the reading order of words, and the location of sentence and paragraph boundaries in the text. The words are grouped into equivalence classes to mimic the terms in a text document. A sample extract for a technical document is shown, and evaluation against a set of abstracts created by a professional abstracting companys created by a professional abstracting company is given. These results are compared with text-based abstracts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-document Summarization System: Using Fuzzy Logic and Genetic Algorithm

In the recent times, the requirement for generation of multi-document summary has gained a lot of attention among the researchers. Mostly, the text summarization technique uses the sentence extraction technique where the salient sentences in the multiple documents are extracted and presented as a summary. In our proposed system, we have developed a sentence extraction based automatic multi-docu...

متن کامل

Discovering Salience in Textual Elements using Graph Mutual Reinforcemnt SI508 Project

The problem of identifying the most salient terms and/or sentences from a set of documents has gained great interest in recent years. Identifying the set of the most salient terms is a set of documents is usually called automatic keyword extraction or terminology extraction. Extracting the most salient set of sentences from a document or a set of documents is used for extractive summarization w...

متن کامل

Sentence Annotation based Enhanced Semantic Summary Generation from Multiple Documents

Problem statement: The goal of document summarization is to provide a summary or outline of manifold documents with reduction in time. Sentence extraction could be a technique that is employed to pick out relevant and vital sentences from documents and presented as a summary. So there is a need to develop more meaningful sentence selection strategy so as to extract most significant sentences. A...

متن کامل

Content based Sentence Ordering using Spanning Tree Algorithm for Improved Multi Document Summarization

Due to the availability of required information in the web, as multiple documents, the need for summarizing these multiple documents and ordering of the sentences in the summary in an efficient way become a relevant task in data mining. We present a novel sentence ordering method based on maximum cost spanning tree algorithm to improve the readability and cohesion of the summary obtained by ext...

متن کامل

Multi-document summarization by cluster/profile relevance and redundancy removal

We describe a sentence extraction system that produces two sorts of multi-document summaries: the first is a general-purpose summary of a cluster of related documents while the second is an entity-based summary of documents related to a particular person. The general-purpose summary is generated by a process that ranks sentences based on their document and cluster “worthiness”. The personality-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997